Skip to content

POC of a symlink-based code sharing approach. #53417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ashb
Copy link
Member

@ashb ashb commented Jul 16, 2025

Don't read too much in to the specific paths chosen.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:CLI area:DAG-processing area:db-migrations PRs with DB migration area:Scheduler including HA (high availability) scheduler area:Triggerer labels Jul 16, 2025
@ashb
Copy link
Member Author

ashb commented Jul 16, 2025

A possible alternative to #53149

@uranusjr
Copy link
Member

I wonder how imports between shared modules should work in this approach.

@ashb
Copy link
Member Author

ashb commented Jul 17, 2025

I wonder how imports between shared modules should work in this approach.

As per Jarek here, cross-imports from shared deps would be banned:

You could just open the logging project from Airlow repo and work on it as if it was a completely standalone project - add and run tests etc. and make sure that it has no dependencies to other part of the system (maintaining low cyclomatic complexity).

@potiuk
Copy link
Member

potiuk commented Jul 17, 2025

Yep. I wil take a look tomorrow/over weekend.

I wonder how imports between shared modules should work in this approach.

This could be done as symlinks between shared modules - but I also think such imports should be discouraged. I think each shared module should be independent from each other - because that is what adspds complexithy. For example - rather than reading a configuration from logging module - you should pass a parameter from outside that provides the config.

Generally spekaing - "needin a one shared module to use another shared module" is an antipattern that shold be possible when really needed, but it should be avoided.

This is waht cyclomatic complexity "theory" is about when turned into practice (as far as I understand it). It you look at breeze code this is how the whole package is organized like. there are about 30 independent modules that do not use each other, whenever there is a need to share things, the shared things are moved into a "one level deep" module that is imported by other modules (global constants is a good example of it).

@ashb
Copy link
Member Author

ashb commented Jul 18, 2025

I haven't yet looked at what this will need to make it work for sdist packages.

Yeah, out of the box this won't make a "valid" sdist as it puts the symlink in the tar without any other changes:

❯ find apache_airflow_task_sdk-1.1.0 -type l
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/timezone.py

~/code/airflow/airflow/dist shared-lib-via-symlinks*
❯ ls -l apache_airflow_task_sdk-1.1.0/src/airflow/sdk/timezone.py
lrwxr-xr-x ash staff 62 B Sun Feb  2 00:00:00 2020  apache_airflow_task_sdk-1.1.0/src/airflow/sdk/timezone.py ⇒ ../../../../shared/timezones/src/airflow_timezones/timezone.py

And that symlinked path isn't valid outside of the repo. This is the layout/contents of the sdist tar:

apache_airflow_task_sdk-1.1.0/dev/datamodel_code_formatter.py
apache_airflow_task_sdk-1.1.0/dev/generate_task_sdk_models.py
apache_airflow_task_sdk-1.1.0/docs/.gitignore
apache_airflow_task_sdk-1.1.0/docs/api.rst
apache_airflow_task_sdk-1.1.0/docs/concepts.rst
apache_airflow_task_sdk-1.1.0/docs/conf.py
apache_airflow_task_sdk-1.1.0/docs/dynamic-task-mapping.rst
apache_airflow_task_sdk-1.1.0/docs/examples.rst
apache_airflow_task_sdk-1.1.0/docs/index.rst
apache_airflow_task_sdk-1.1.0/docs/img/airflow-2-approach.png
apache_airflow_task_sdk-1.1.0/docs/img/airflow-2-arch.png
apache_airflow_task_sdk-1.1.0/docs/img/airflow-3-arch.png
apache_airflow_task_sdk-1.1.0/docs/img/airflow-3-task-sdk.png
apache_airflow_task_sdk-1.1.0/src/airflow/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/__init__.pyi
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/exceptions.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/log.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/py.typed
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/timezone.py

So I think this means that in either approach (this or the vendoring one) we'll need a custom hatchling plugin of one form or the other.

@ashb
Copy link
Member Author

ashb commented Jul 18, 2025

Generally speaking - "needing a one shared module to use another shared module" is an antipattern that should be possible when really needed, but it should be avoided.

That is an opinion I don't agree with. Quite the oppsite in fact: I think it is very likely we will want to use the "timezone" shared lib (utcnow, date parsing) across many of the other shared libs. (Kube pod to parse logs, logging to deal with formatting are the first two that spring to mind)

However "should be possible when needed" -- I don't think it is. I can't think what imports we would put in, say the logging/structlog.py code that would work in all cases (inside core, inside task-sdk and inside it's own shared tests)

@ashb

This comment was marked as resolved.

And move `airflow.utils.timezone` into a shared library as the first example
of it working.

In this change we have now setled on an approach using symlinks, but we did
explore other options (see the GH PR for discussion and previous versions,
notably one built upon the `vendoring` tool)

A lot of the reasoning and mode of operation of this is detailed in
shared/README.md in this PR, hence why this description is so short.

Currently various places in TaskSDK and Airflow Core both use these utility
functions, and while in this specific case they are small enough that they
could just be copied and the duplication wouldn't hurt us long term, this
changes shows a way in which we can have a single source of truth, but have it
included automatically in built dists.

Co-authored-by: Jarek Potiuk <[email protected]>
@ashb ashb force-pushed the shared-lib-via-symlinks branch from 0f20bda to efbe2a2 Compare July 21, 2025 17:03
@ashb ashb closed this Jul 21, 2025
@ashb
Copy link
Member Author

ashb commented Jul 21, 2025

Merged in to the other linked pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:CLI area:DAG-processing area:db-migrations PRs with DB migration area:Scheduler including HA (high availability) scheduler area:Triggerer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants